Presentation


Abstract

Leveraging Generative AI to Enhance Assessment Validity and Engagement

Assessment validity in high-stakes Laboratory Science courses is often compromised by limited, familiar question pools that encourage rote memorization over conceptual mastery. This project implemented Generative AI (GenAI) models to rapidly produce thousands of novel, content-specific assessment questions, with prompt designs explicitly targeting specific course material and Bloom’s Taxonomy Levels. The intervention was deployed across two distinct courses, utilizing the questions in both low-stakes (formative, unlimited attempts) and high-stakes (summative, proctored) environments. Preliminary data showed a higher percentage of correct answers on low-stakes self-assessments compared to high-stakes summative exams, suggesting GenAI’s utility in managing assessment security and differentiating learning stages. GenAI significantly increased assessment throughput and student engagement by maintaining question freshness and eliminating reliance on shared test banks. These preliminary findings suggest successful modulation of difficulty via Bloom’s Taxonomy prompting and validation of GenAI as a scalable method for creating rigorous and accurate assessments, leading to a clearer perspective on true student progression and mastery.

Paper

NOTE TO READER

The following is a draft

INTRODUCTION

Assessment is a fundamental driver of student learning in higher education, particularly within the health professions where the demonstration of clinical competence is paramount. Among the various assessment modalities, Multiple-Choice Questions (MCQs) remain the most ubiquitous format due to their efficiency in grading, objectivity, and capacity to cover broad curriculum content in a short timeframe (Zaidi et al., 2018; Scully, 2017). When constructed effectively, MCQs can support learner engagement and provide critical formative feedback that bridges the gap between didactic instruction and clinical application. However, the widespread reliance on MCQs is accompanied by persistent concerns regarding item quality, cognitive depth, and the significant resource burdens associated with maintaining valid item banks.

A primary pedagogical criticism of MCQs is their tendency to target the lower levels of Bloom’s Taxonomy - specifically rote memorization and recall - rather than the analysis, synthesis, or evaluation required for professional practice (Zahoor et al., 2023; Middlemas & Hensal, 2009). While faculty often aim to assess critical thinking, research suggests a disconnect between educator intent and item quality; questions intended to measure higher-order thinking are frequently perceived by students as tests of rote memory. This misalignment encourages surface learning strategies, such as “cramming,” rather than the development of deep conceptual schemas (Azer, 2003; Zaidi et al., 2018).

To enhance validity, educators are encouraged to utilize scenario-based stems that require the application of knowledge. However, constructing high-quality, vignette-based MCQs is a resource-intensive and technically difficult task (Hurtz et al., 2012). This challenge is compounded by the necessity of “refreshing” item banks to mitigate “item decay” - the phenomenon where questions become easier over time due to exposure (Joncas et al., 2018; Naidoo, 2023). The “item-writing bottleneck” creates a tension between the need for frequent, engaging formative assessments and the limited time faculty have to produce them (Ali et al., 2018).

The advent of Generative AI and Large Language Models (LLMs) offers a potential solution to these historical limitations. Unlike traditional template-based generation, modern LLMs utilize Natural Language Processing (NLP) to interpret complex concepts and generate human-like text, acting as a force multiplier for educators (Al Shuraiqi et al., 2024; Karahan & Emekli, 2025). Recent literature suggests that AI tools can produce hundreds of items in minutes, potentially democratizing the creation of case-based assessments that mirror authentic clinical challenges (Laupichler et al., 2024). By facilitating the rapid creation of diverse, personalized practice questions, AI has the potential to support “test-enhanced learning,” transforming assessment from a passive measurement tool into an active engagement strategy (Indran et al., 2024).

However, the efficacy of AI in enhancing assessment validity remains under scrutiny. While AI offers speed, questions persist regarding its ability to reliably generate items at specific cognitive levels. Some studies indicate that AI-generated questions may be statistically easier or less discriminating than human-authored items, potentially failing to challenge high-performing students if not carefully prompted and reviewed (Kaya et al., 2025; Laupichler et al., 2024). Consequently, there is a need to rigorously evaluate whether AI can be directed to produce items that align with higher-order cognitive domains and whether the availability of these resources translates into tangible student engagement

To address these gaps, this study investigates the utility of Generative AI within an educational curriculum. Specifically, this research aims to assess the difficulty of AI-generated MCQs in relation to a prompted Bloom’s taxonomy level, while simultaneously evaluating course engagement through an analysis of the number of assessment attempts completed by students.

MATERIALS AND METHODS

This study utilized a retrospective exploratory analysis to evaluate the efficacy of AI-generated assessment questions within a health professions curriculum. The study was conducted at an academic institution within a Medical Laboratory Science (MLS) program accredited by the National Accrediting Agency for Clinical Laboratory Sciences (NAACLS). Data were collected over one academic semester (Fall 2025). The assessment questions evaluated were integrated into the undergraduate and graduate curricula, specifically targeting two core courses: Clinical Hematology Lecture and Clinical Hematology Laboratory.

The study analyzed two distinct units: the psychometric properties of AI-generated items (Aim 1) and the assessment behaviors of the student cohort (Aim 2). The cohort consisted of 13 students (N = 7 Bachelor, N = 6 Master) enrolled concurrently in both courses. Inclusion criteria for the final analysis comprised all completed assessment attempts (quizzes and exams) containing the AI-generated questions and recorded during the study period. Attempts marked as “incomplete,” invalid based on completion duration, or data pertaining to students who withdrew from the program prior to course completion were excluded.

Assessment questions were generated using Google’s Gemini models (versions gemini-2.5-pro and gemini-3-pro-preview), accessed through Google’s AI Studio. To generate psychometrically sound assessment items, a structured prompt engineering strategy was employed (see Supplementary Appendix A for full text). The AI was assigned the persona of an expert Medical Laboratory Science assessor and directed via Chain-of-Thought prompting (Wei et al., 2022) to generate items across four specific levels of Bloom’s Taxonomy: Remember, Understand, Apply, and Analyze. The system instructions explicitly defined the required cognitive task for each level (e.g., “recall specific facts” vs. “differentiate relationships”) and provided logic for generating plausible distractors (e.g., “common clinical misconceptions”).

Following an initial quality check (“Stage 1”) that revealed a tendency for the AI to generate significantly longer text for correct answers—a common cueing flaw—a “Stage 2” protocol was implemented. This protocol enforced a “Concise Truth” rule to constrain correct answers to be direct and concise, and an “Adjective Loading” rule to artificially lengthen distractors. The final output required the question stem, four options, comprehensive remedial feedback for all choices, and an AI-predicted difficulty rating (1 = Easy - 10 = Hard).

The source material ingested by the AI included lecture PowerPoint presentations (Clinical Hematology Lecture) and Laboratory Investigation procedures (Clinical Hematology Laboratory). To ensure validity, the generated items underwent a “human-in-the-loop” review process by one board-certified MLS with a Specialist in Hematology credential and 13 years of Clinical Hematology experience. Items were vetted for content accuracy, distractor plausibility, grammatical syntax, and alignment with the ingested source material. Stage 1 generated questions (N = 4,019) and Stage 2 generated questions (N = 5,844) were added to question pools associated with specific learning modules corresponding to the weekly curriculum.

Assessments were administered via the Desire2Learn (D2L) Learning Management System. The assessment structure was divided into low-stakes formative quizzes and high-stakes summative examinations. Low-stakes assessments were available at the start of a learning module and consisted of unproctored quizzes (5-16 questions). Students were permitted unlimited attempts to encourage mastery and record the highest scored attempt for their gradebook. High-stakes assessments were administered at fixed dates and times and consisted of a single proctored attempt for module examinations (25-40 questions) and course final examinations (100 questions).

Each assessment attempt, including any subsequent attempts, was designed to deliver a random set of questions from the pool of AI-generated questions specific to the module content being assessed. For all assessments, a total time limit was set to allow 1.5 minutes per question. To facilitate learning, students received immediate feedback upon completion, including the student’s response, the correct response, and an AI-generated explanation of the rationale. Participation in the assessments was a mandatory component of the coursework.

Data were extracted directly from the D2L platform using the export features within the “Grade” option of assessments. The study focused on two primary aims: assessment of difficulty and assessment of engagement. Difficulty was calculated as the aggregate performance on all questions stratified by the questions’ associated Bloom’s level. Due to the potential of repeat question exposure, only the first question attempt for each student was considered. Engagement was defined as the total number of assessment attempts completed by students beyond the mandatory one attempt requirement.

Data cleaning was performed and statistical analyses were conducted using R (version 4.5.2). Descriptive statistics were calculated to characterize the AI-generated question pool.

To validate the efficacy of the prompt engineering interventions, two specific analyses were performed on the generated items. First, an independent samples t-test was conducted to compare the mean word count of correct answer options between Stage 1 (unconstrained) and Stage 2 (constrained). Second, a Chi-squared test of independence was used to compare the proportions of option word lengths - specifically the prevalence of items where the correct answer was the longest option - between Stage 1 and Stage 2, to assess the reduction of length-based cueing.

To evaluate Aim 1 (Difficulty), difficulty was operationalized as the proportion of correct student responses. A Chi-squared Test for Trend in Proportions was conducted to determine if student success rates significantly decreased as the prompted Bloom’s Taxonomy level increased (ordinal trend). Subsequent pairwise comparisons between specific Bloom’s levels were analyzed using Chi-squared tests of independence with Bonferroni correction to adjust for multiple comparisons. Additionally, Spearman’s Rank Correlation was used to assess the relationship between the AI-predicted difficulty rating (1 - 10) and actual student performance.

To evaluate Aim 2 (Engagement), descriptive statistics (median and interquartile range) were generated for the frequency distribution of attempts. Due to the small sample size (N = 13), non-parametric Spearman’s Rank Correlation was used to examine relationships between voluntary engagement frequency and final exam performance (Z-score). A p-value of < 0.05 was used to determine statistical significance.

RESULTS

Validation of Prompt Engineering

A total of 9,826 valid items were generated and injected into the course curriculum. This pool comprised 4,019 items generated during Stage 1 (Unconstrained) and 5,807 items generated during Stage 2 (Constrained/Engineered).

To mitigate the tendency of Generative AI to produce conspicuously long correct answers, a “Concise Truth” rule was implemented in Stage 2. Descriptive statistics revealed a marked reduction in the length of correct answer options between stages. The median word count for correct answers decreased from 14 (IQR = 7-20) in Stage 1 to 5 (IQR = 2-8) in Stage 2. Similarly, character counts decreased from a median 95 (IQR = 53-129) to 34 (IQR = 19-52). A Mann-Whitney test indicated this reduction was statistically significant (p < .001). This significant reduction in word count was observed consistently across all four prompted Bloom’s Taxonomy levels (p < .001 for all levels).

To address the tendency for the correct answer to be the longest option, an “Adjective Loading” rule was introduced in Stage 2 distractors. In Stage 1, the correct answer was the longest option in 40.8% of items. Following the rule’s application in Stage 2, this frequency decreased by 1.8% to 39.0%. A Chi-squared test indicated no statistically significant difference between the stages in the frequency of the longest option being the correct answer (χ2 = 3.031, df = 1, p = .082).

Option Length NOT Longest Option Length Longest
Stage 1 2381 1638
Stage 2 3543 2264
Aim 1: Assessment of Difficulty

The total item pool was evenly distributed across the targeted Bloom’s Taxonomy levels: Remember (24.9%), Understand (25.3%), Apply (24.8%), and Analyze (25.0%). Of the 9,826 items generated, students were exposed to 5,748 (58.5%) unique items. Stage 1 exposure (N = 3,255) had slightly more items at Remember level (N = 883, 28.2%) versus other levels - Understand (N = 789, 24.2%), Apply (N = 791, 24.3%), and Analyze (N = 792, 24.3%). Stage 2 exposure (N = 2,493) had a similar distribution: Remember (N = 737, 29.6%), Understand (N = 578, 23.1%), Apply (N = 582, 23.3%), and Analyze (N = 596, 23.9%)

Aggregate student performance demonstrated a significant inverse relationship between the prompted Bloom’s Taxonomy level and item difficulty. As the cognitive level increased, the percentage of correct responses decreased: Remember (80.7%), Understand (78.6%), Apply (75.8%), and Analyze (73.2%). A Chi-squared Test for Trend in Proportions confirmed this trend was statistically significant (χ2 = 70.556, df = 1, p < .001).

Post-hoc pairwise comparisons revealed significant differences in difficulty between the Remember level and both Apply (p < .001) and Analyze (p < .001) levels. Items at the Understand level were significantly different than those at the Analyze level (p < .001) and Apply level (p = .03). However, the difference between Apply and Analyze did not reach statistical significance (p = .08).

When stratified by course, items generated for the Clinical Hematology Laboratory were significantly more difficult (75.2% correct) than those for the Clinical Hematology Lecture (78.7% correct; χ2 = 24.948, df = 1, p < .001). Lecture items followed the aggregate trend, with difficulty significantly increasing across Bloom’s levels (χ2 = 77.77, df = 1, p < .001). Laboratory items also demonstrated a significant trend (χ2 = 8.228, df = 1, p = .004); however, pairwise comparisons indicated significant differences observed only between the Understand and Analyze levels (p = .02).

For Stage 2 items, the AI provided a predicted difficulty rating (1–10). Spearman’s Rank Correlation showed a strong positive correlation between the AI’s predicted rating and the prompted Bloom’s level (ρ = .815, p < .001).

Furthermore, a Chi-squared Test for Trend confirmed that actual student performance significantly declined as the AI-predicted difficulty rating increased (χ2 = 15.839, df = 5, p = .007), demonstrating a concordance between AI prediction and student performance. To meet criteria for this Chi-squared analysis, items at difficulty levels one and two were grouped and difficulties seven and greater were grouped.

Difficulty Exposed (N) Correct (N)
1 69 66
2 899 698
3 613 485
4 747 563
5 817 607
6 427 316
7 152 101
8 13 11
9 4 3
10 1 1

Aim 2: Course Engagement

A total of 2,802 voluntary attempts (defined as attempts beyond the mandatory requirement) were recorded across the cohort (N = 13). The median number of extra attempts per student was 138 (IQR = 98-258), with individual engagement ranging from 44 to 753 total extra attempts. Engagement was higher in the Lecture course (N Attempts = 1,689; Median = 94, IQR = 55-162) compared to the Laboratory course (N Attempts = 1,113; Median = 58, IQR = 43-98).

There was a strong, positive correlation between the frequency of voluntary assessment attempts and student overall course. Spearman’s Rank Correlation analysis revealed a significant relationship between the number of practice attempts and final course grades as Z-scores (ρ = .638, p < .001).

DISCUSSION

This study sought to evaluate the efficacy of leveraging Generative AI to address the “item-writing bottleneck” in health professions education by assessing the validity, difficulty, and engagement potential of AI-generated MCQs. The results demonstrate that while LLMs can function as powerful “force multipliers” capable of generating vast item pools, they exhibit distinct limitations in psychometric precision and higher-order cognitive differentiation. The findings suggest that AI is highly effective at fostering student engagement through volume-based testing, but requires significant human intervention (“prompt engineering”) to mitigate inherent biases and validity threats.

A persistent criticism of AI-generated content is its tendency toward verbosity, which can introduce “test-wiseness” cues where the longest answer is frequently the correct one (Artsi et al., 2024). This study’s “Stage 2” intervention successfully addressed the verbosity of correct answers, achieving a statistically significant reduction in word count through the “Concise Truth” rule. However, the attempt to counterbalance this by artificially lengthening distractors (“Adjective Loading”) failed to produce a statistically significant reduction in cueing globally (p = .08). While subgroup analyses suggested efficacy at the Remember and Analyze levels, the lack of a consistent effect across all domains indicates that the AI prioritized semantic logic over rigid structural constraints. This finding aligns with recent literature suggesting that LLMs favor the “truth” and may resist generating plausible but incorrect distractors that match the complexity of the correct answer without extensive, iterative chain-of-thought prompting (Yao et al., 2025). Consequently, educators must remain vigilant during the review process, as algorithmic constraints alone may not fully eliminate psychometric flaws.

The analysis of item difficulty revealed a statistically significant, albeit shallow, gradient across Bloom’s Taxonomy. While the AI successfully made Remember items easier than Analyze items, the practical difference in student performance was narrow (7.5%), and the system failed to significantly differentiate between the Apply and Analyze levels (p = .08). This suggests a “Complexity Ceiling,” where the AI mimics the structure of higher-order questions (e.g., using clinical vignettes) but may not achieve the requisite increase in cognitive load.

This plateau mirrors findings by Law et al. (2025) and Shuraiqi et al. (2024), who noted that AI-generated items often test the retrieval of isolated facts embedded within a vignette rather than true clinical synthesis. The significant difference in difficulty between Lecture and Laboratory items further complicates this picture, suggesting that the prompt source material influences the model’s ability to generate difficulty. The strong correlation between the AI’s predicted difficulty rating and student performance is promising, yet it indicates the model is better at predicting how students will perform on its own logic than it is at constructing truly distinct cognitive challenges at the highest levels of taxonomy.

The most robust finding of this study is the high volume of voluntary student engagement. With a median of 138 extra attempts per student, the availability of an unlimited, low-stakes AI question bank facilitated massive “test-enhanced learning.” The strong correlation (ρ = .638) between these voluntary attempts and final course performance supports the pedagogical value of frequent formative retrieval practice (Say et al., 2022; Mui Lim & Rodger, 2010; Steinel et al., 2022; Kulasegaram & Rangachari, 2018). However, these engagement data must be interpreted with extreme caution due to the small sample size (N=13). The correlation is susceptible to the influence of outliers - specifically, “super-users” who completed over 700 attempts. While the signal is positive, it is unclear if the benefit is derived from the quality of the AI items or simply the quantity of time-on-task. Furthermore, it cannot be rule out that high-performing students are simply more motivated to use available resources, rather than the resources causing the high performance.

Several limitations constrain the generalizability of these findings. First, the disparity between the number of items generated (N=9,826) and the number of students (N=13) results in high statistical power for item analysis but low power for student outcome analysis. Second, because items were randomized, students were exposed to a subset (N=5,748) of the total pool; while the assumumption that the randomization was uniform, it is possible that the difficulty trends observed are artifacts of the specific items exposed rather than the entire generated corpus. Finally, the study was conducted at a single institution within a specific domain (Hematology). The “Concise Truth” and “Adjective Loading” rules may perform differently in domains requiring less descriptive nuance.

Generative AI represents a transformative tool for assessment creation, offering a solution to the labor-intensive demands of item writing. It succeeds in creating engaging, grammatically coherent content that scales to student demand. However, it is not yet a “set it and forget it” solution. The “Complexity Ceiling” and the persistence of distractor flaws highlight the necessity of a “human-in-the-loop” workflow. Educators must view AI as a drafter of content that requires expert refinement to ensure it assesses deep clinical reasoning rather than superficial recall. Future research should focus on optimizing prompts to break through the complexity ceiling and validating these findings across larger, multi-institutional cohorts.

REFERENCES

Zaidi NL, Grob KL, Monrad SM, Kurtz JB, Tai A, Ahmed AZ, Gruppen LD, Santen SA. Pushing critical thinking skills with multiple-choice questions: does Bloom’s taxonomy work?. Academic Medicine. 2018 Jun 1;93(6):856-9.

Scully D. Constructing multiple-choice items to measure higher-order thinking. Practical Assessment, Research and Evaluation (PARE). 2017;22(1):1-3.

Zahoor AW, Farooqui SI, Khan A, Kazmi SA, Qamar N, Rizvi J. Evaluation of Cognitive Domain in Objective Exam of Physiotherapy Teaching Program by Using Bloom’s Taxonomy. Journal of Health and Allied Sciences NU. 2023 Apr;13(02):289-93.

Middlemas DA, Hensal C. Issues in selecting methods of evaluating clinical competence in the health professions: implications for athletic training education. Athletic Training Education Journal. 2009 Jul 1;4(3):109-16.

Azer SA. Assessment in a problem‐based learning course: Twelve tips for constructing multiple choice questions that test students’ cognitive skills. Biochemistry and Molecular Biology Education. 2003 Nov;31(6):428-34.

Hurtz GM, Chinn RN, Barnhill GC, Hertz NR. Measuring clinical decision making: do key features problems measure higher level cognitive processes?. Evaluation & the health professions. 2012 Dec;35(4):396-415.

Joncas SX, St-Onge C, Bourque S, Farand P. Re-using questions in classroom-based assessment: an exploratory study at the undergraduate medical education level. Perspectives on medical education. 2018 Dec;7(6):373-8.

Naidoo M. The pearls and pitfalls of setting high-quality multiple choice questions for clinical medicine. South African Family Practice. 2023;65(3).

Ali K, Zahra D, Tredwin C, Mcilwaine C, Jones G. Use of progress testing in a UK dental therapy and hygiene educational program. Journal of dental education. 2018 Feb;82(2):130-6.

Al Shuraiqi S, Aal Abdulsalam A, Masters K, Zidoum H, AlZaabi A. Automatic generation of medical case-based multiple-choice questions (MCQs): a review of methodologies, applications, evaluation, and future directions. Big Data and Cognitive Computing. 2024 Oct 17;8(10):139.

Laupichler MC, Rother JF, Grunwald Kadow IC, Ahmadi S, Raupach T. Large language models in medical education: comparing ChatGPT-to human-generated exam questions. Academic Medicine. 2024 May;99(5):508-12.

Indran IR, Paranthaman P, Gupta N, Mustafa N. Twelve tips to leverage AI for efficient and effective medical question generation: a guide for educators using Chat GPT. Medical Teacher. 2024 Aug 2;46(8):1021-6.

Kaya M, Sonmez E, Halici A, Yildirim H, Coskun A. Comparison of AI-generated and clinician-designed multiple-choice questions in emergency medicine exam: a psychometric analysis. BMC Medical Education. 2025 Jul 1;25(1):949.

Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le QV, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems. 2022 Dec 6;35:24824-37.

Yao Z, Parashar A, Zhou H, Jang WS, Ouyang F, Yang Z, Yu H. Mcqg-srefine: Multiple choice question generation and evaluation with iterative self-critique, correction, and comparison feedback. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) 2025 Apr (pp. 10728-10777).

Kulasegaram K, Rangachari PK. Beyond “formative”: assessments to enrich student learning. Advances in physiology education. 2018 Mar 1;42(1):5-14.

Prompts

Stage 1


System Instruction

You are a question generator to assess a Medical Laboratory Science student’s understanding of [insert topic]. Questions should be associated with a level of Bloom’s taxonomy. Each question should be multiple choice with four selections. Feedback should be included that discusses why a selection response is correct and the other selection responses are incorrect.

Instruction for question generation:
Unique questions must be generated for each prompt. Questions must be generated at the requested level of Bloom’s taxonomy:

  • Remember:
    • Cognitive Task: Define, identify, list, name, recall, recognize, state. The goal is to test the student’s ability to retrieve specific, rote information from memory.
      • Instructions:
        1. Cognitive Target: The question must only require the test-taker to recall a specific fact, term, or definition. It should not require any interpretation.
        2. Question Stem: Write a direct question (e.g., “What is…” “Which of the following…”) or a sentence-completion stem (e.g., “The capital of…”).
        3. Correct Answer: Provide the single, factually correct answer.
        4. Plausible Distractors: Generate three distractors that are categorically similar but factually incorrect. For example, if the correct answer is a person’s name, the distractors must also be names from the same context.
        5. Rationale: Rationale for Correct Answer (A): Explain why this is the correct, memorized fact. Rationale for Distractors (B, C, D): For each, explain why it is incorrect but plausible (e.g., “This is another type of tissue, but not a muscle tissue,” or “This was a scientist from the same era, but not the one who made this specific discovery.”)
  • Understand:
    • Cognitive Task: Explain, summarize, classify, compare, infer, paraphrase, give an example. The goal is to test if the student can do more than just recall - they must be able to interpret, summarize, paraphrase, or classify information. This is a significant step up from “Remember.”
    • Instructions:
      1. Cognitive Target: The question must require the test-taker to demonstrate comprehension, not just recall. This can be done by asking them to:
        • Identify the best summary of the concept.
        • Classify a new example.
        • Explain a concept in their own words (presented as options).
      2. Question Stem: Write a stem that asks for interpretation. (e.g., “Which of the following statements best summarizes…” or “Which of the following is the clearest example of…”).
      3. Correct Answer: Provide the correct answer. This option should be a correct paraphrase or application of the core concept.
      4. Plausible Distractors: Generate three distractors. These are critical:
        • One distractor should be a common misunderstanding of the concept.
        • One distractor can be a word-for-word definition of a different, but related, concept from the same domain.
        • One distractor can be overly simplistic or an illogical conclusion.
      5. Rationale: Rationale for Correct Answer (A): Explain why this option correctly interprets or classifies the concept. Rationale for Distractors (B, C, D): Explain the specific misunderstanding each distractor targets (e.g., “This option confuses ‘opportunity cost’ with ‘sunk cost,’ a common error.”)
  • Apply:
    • Cognitive Task: Execute, implement, solve, use, demonstrate, operate, apply. The goal is to test if the student can use a learned concept, rule, method, or principle in a new and concrete situation. The question must present a novel scenario that was not explicitly covered in the learning material.
    • Instructions:
      1. Cognitive Target: The question must require the test-taker to apply a known concept to solve a problem or predict an outcome in a novel scenario.
      2. Scenario: Present a brief, new “mini-case study” or problem.
      3. Question Stem: Write a question that asks, “What would be the correct application…” or “How would you solve…” or “What outcome would you expect…”.
      4. Correct Answer: Provide the correct answer, which represents the proper application of the concept to the scenario’s variables.
      5. Plausible Distractors: Generate three distractors that represent common application errors:
        • One distractor should be a misapplication of the correct concept (e.g., using the right rule but in the wrong way).
        • One distractor should be the correct application of a different, related concept.
        • One distractor should be a common-sense “solution” that ignores the specific concept being tested.
      6. Rationale:
        • Rationale for Correct Answer (A): Explain how this option correctly applies the concept to the novel scenario.
        • Rationale for Distractors (B, C, D): Explain the specific error in reasoning each distractor is designed to catch.
  • Analyze:
    • Cognitive Task: Differentiate, organize, relate, compare, contrast, distinguish, categorize, analyze, find patterns. The goal is to test if the student can break down material into its constituent parts and detect the relationships between them. This includes identifying motives, causes, unstated assumptions, or the structure of an argument.
    • Instructions:
      1. Cognitive Target: The question must require the test-taker to analyze the provided material. This means they must, for example, identify an unstated assumption, differentiate fact from opinion, identify a logical fallacy, or determine the relationship between two variables in the data.
      2. Question Stem: Write a question that targets these relationships. (e.g., “What is the primary assumption the author makes…” or “The data in the chart suggests which of the following relationships…” or “The customer’s complaint is primarily caused by a breakdown in which process…”).
      3. Correct Answer: Provide the correct answer, which is a logical and accurate analysis of the provided material.
      4. Plausible Distractors: Generate three distractors that represent plausible but flawed analysis: One distractor should be an interpretation that is too superficial (e.g., a “Remember” level observation).
        • One distractor should be a logical over-extrapolation or exaggeration of the material.
        • One distractor should be a plausible but unsupported analysis (i.e., it cannot be proven by the material provided).
      5. Rationale:
        • Rationale for Correct Answer (A): Explain why this analysis is the most accurate and best supported by the provided material.
        • Rationale for Distractors (B, C, D): Explain the specific flaw in analysis that each distractor represents.

Output format

Question Text: (insert question text)
A) insert BEST/CORRECT answer
B) insert DISTRACTOR/PLAUSIBLE FOIL
C) insert DISTRACTOR/PLAUSIBLE FOIL
D) insert DISTRACTOR/PLAUSIBLE FOIL

Feedback:
A) Incorrect:/Correct: insert clear and thorough reasoning
B) Incorrect:/Correct: insert clear and thorough reasoning
C) Incorrect:/Correct: insert clear and thorough reasoning
D) Incorrect:/Correct: insert clear and thorough reasoning

Difficulty Level: (provide an objective level of difficulty for the question on a scale of 1 (easy) to 10 (expert))
Difficulty Reason: (thorough/elaborate reasoning to difficulty judgement)

Prompt

5 questions at taxonomy level [insert Bloom level]

Stage 2

System Instruction

[Insert Stage 1 System Instruction with the following appended]

Instruction for option lengths (STRICT ENFORCEMENT):

  • Word Count Limit: Maximum 10 words per option.
  • The “Concise Truth” Rule: The Correct Answer must be the most direct, concise statement possible. Remove all unnecessary articles (‘a’, ‘the’), qualifiers (‘typically’, ‘usually’), or softening language. The Correct Answer MUST not be the longest option.
  • The “Wordy Falsehood” Rule: You must artificially lengthen the Distractors using “Adjective Loading.” Add plausible-sounding but irrelevant adjectives, specific chemical names, or physiological conditions to the distractors to increase their word count/character length.
  • Length Hierarchy (CRITICAL STRICTEST ENFORCEMENT):
    1. Option A (Correct) -> MUST NOT be highest word count or character length.
    2. Option B (Distractor) -> MUST be Medium word count or character length.
    3. Option C (Distractor) -> MUST be Medium word count or character length.
    4. Option D (Distractor) -> MUST be the highest word count or character length.
  • Forbidden Pattern: Under no circumstances should the Correct Answer possess the highest word count or character count.

Output format

Difficulty Level: (provide an objective level of difficulty for the question on a scale of 1 (easy) to 10 (expert))
Difficulty Reason: (thorough/elaborate reasoning to difficulty judgement)

Prompt

5 questions at taxonomy level [insert Bloom level]

Questions

Questions were additionally generated and incorporated into another teaching tool that helps prepare students for certification exams. Click the image below to see representative questions:

Representative Questions